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Abstract —In this work we solve the day-ahead unit commit¬ 
ment (UC) problem, by formulating it as a Markov decision 
process (MDP) and finding a low-cost policy for generation 
scheduling. We present two reinforcement learning algorithms, 
and devise a third one. We compare our results to previous 
work that uses simulated annealing (SA), and show a 27% 
improvement in operation costs, with running time of 2.5 minutes 
(compared to 2.5 hours of existing state-of-the-art). 

Index Terms —Power generation dispatch. Learning (artificial 
intelligence), Optimal scheduling. Optimization methods. 


I. Introduction 


Unit commitment (UC) is the process of determining the 
most cost-effective combination of generating units and their 
generation levels within a power system to meet forecasted 
load and reserve requirements, while adhering to genera¬ 
tor and transmission constraints 0. This is a non-linear, 
mixed-integer combinatorial optimization problem 0. Low- 
cost solutions to this problem will directly translate into low 
production costs for power utilities. As the size of the problem 
increases, it becomes a very complex, hard to solve problem 

0 - 

Multiple optimization approaches have been applied over the 
past years, such as the priority ordering methods 0, pO) , 
dynamic programming E). Lagrangian relaxation | |13) , the 
branch-and-bound method 115 , and the integer and mixed- 
integer programming Hg, P7[ . Other, more recent methods 
are from the field of artificial intelligence, such as the expert 
systems fTS), n eural networks 03, fuzzy logic pO) , genetic 
algorithms |21), and simulated annealing 0- 
Many of these approaches are either purely heuristic (e.g. 
priority ordering) or semi-heuristic (e.g. simulated annealing) 
, thus are often very sensitive to choice of architecture, 
manual parameter tuning, and different cost functions. On 
the other hand, analytical methods can also introduce critical 
shortcomings. The branch-and-bound algorithm, for instance, 
suffers from an exponential growth in execution time with 
the size of the UC problem 03 ’ | [T5| . In addition, using 
approximations for making it tractable for large scale systems 
causes solutions to be highly sub-optimal. 


Therefore in our work, we take an analytical approach to 
the problem, while assuring it will not become intractable 
nor highly suboptimal in large scale systems. We use a 
Markov Decision Process (MDP) framework. MDPs are used 
to describe numerous phenomena in many fields of science 0. 
Such a model is aimed to describe a decision making process, 
where outcomes of the process are partly random and partly 
under the control of the decision maker. 

In this work we assume that generation cost functions of 
the different generators are known to the decision maker. 
We note that this is often not the case with European sys¬ 
tem operators, since in a European competitive electricity 
market, cost information is not available. However, in many 
other cases this information is indeed available, such as in 
some north American TSOs, and generation companies with 
multiple generation units (such a company would not know 
the characteristics of the power system, nevertheless it is not 
problematic since they do not play a role in our formulation). 
In addition, the UC problem can easily be extended to generate 
production schedules in a competitive market environment 0. 
Another paper shows the framework in which a traditional 
cost-based unit commitment tool can be used to assist bidding 
strategy decisions to a day-ahead electricity pool market 0. 
In general, European TSOs can approximate generation costs 
based on historical data (that include past and present bids 
they receive from generators) and market simulation. Also, in 
future work, the uncertainty in these approximations can be 
naturally expressed in our MDP model. 

The rest of the paper is organized as follows. Section II 
formulates the unit commitment problem. We then present 
our MDP model for the UC problem in section III , and give 
an introduction to reinforcement learning in section IV. The 
algorithms we use are presented in section V. Then, in section 

VI we show numerical tests of our methods. Lastly, in section 

VII we summarize our work. 

H. Unit Commitment Problem Eormulation 

The problem is formulated as the following constrained 
optimization program. 




A. Objective 


The objective is to find a feasible plan with minimal cost 
for operating generators to meet client demands - 
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Where: 

ai{t) = 1 when unit i is turned on at time and ai{t) = 0 
otherwise. 

Pi{t) is the injected power [MIT] in unit i at time t. 

Ci{P) is the cost [$] of injecting power P in unit i . 
SCi{toffi) is the start-up cost [$] of unit i after it has been 
off for a time period of fo//i- 

B. Constraints 


Any feasible solution is subject to the following constraints: 

• Load balance - 

N 
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• Generation limits - 


: a,{t)Pmini < Pt{t) < a^{t)PmaXi- (3) 


• Set generation limits - 

N 

Vf : (^^ai{t)Pmini) < D{t), 

N 

C^a^{t)PmaXi) > D{t)+R{t). 

i=l 

• Minimum up/down time - 
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Where: 

D{t) is the demand at time t. 

R{t) is the needed power reserve at time t. 

Pmini, Pmaxi are the minimal and maximal power injections 
in unit i. 


tof fi , tom are the minimal time periods before turning unit i 
on/off. 



Fig. 1. (A) shows the generation cost of a specific generator, and (B) shows 

the start-up cost of that generator, as a function of the time it was off 


HI. Markov Decision Process Approach 

Finding a global optimum is intractable for this non 
convex, mixed integer-quadratic problem. Therefore, unlike 
in IT), where a direct search in the solution space was 
performed, we suggest an alternative approach: decomposing 
the objective into a sequential decision making process. We 
use a Markov Decision Process 4-tuple (S', A, P, R) Q to 
model the system’s dynamics. Briefly, in this model at each 
time step, the process is in some state s, an action a is 
taken, the system transitions to a next state s' according to 
a transition kernel P{s,a,s'), and a reward R{s,a,s') is 
granted. Thus, the next state s' depends only on the current 
state s and the decision maker’s action a. 


A. State-Space 

The system’s state can be fully described by the on/off time 
of each of the N generators, and the time of the day (negative 
values indicate off time): 

5 = {-24, -23, ...,-1,1,2,..., 24}"^ x (1,..., 24}. 

B. Action Space 

Each unit can be turned/kept on, or turned/kept off: 

A = {0,1}^. 


C. Reward 

At each time step, the reward is (minus) the cost of operation 
of the N machines: 


C. Costs 

• Generation Cost - quadratic function of the power gen¬ 
erated by that unit: 

C,{Pi)=a,P'l + hP, + c,. (6) 

• Start-up cost - an exponentially dependent function of 
the number of hours a unit has been down: 


N 

R{s,a,s') = — ''^^[I[s'.>o]Cti{Pi) + T[s'>o]T[si<o]5'C'i(si)]. 

The power injections Pi are chosen by solving the appropriate 
constrained quadratic program (generation cost is quadratic). 
By maximizing the undiscounted cumulative reward of the 
MDP, we minimize the original objective. 


SCi[toff-) = eiexp(-pAo//J+ /iexp(-/iAo//J- (7) 

A graphical example of generation (A) and start-up (B) costs 
of a specific generator (with the parameters used in the 
experiments section) is displayed in Figure [T] 


D. Transition Kernel 

Transition is deterministic: /(s, a) = s'. The transition 
function restricts the process to satisfy the constraints of the 
optimization problem. 

A transition example is presented in Figure [^ 








Fig. 2. Example for state transition - f(s, a) = s'. Generators are turned/kept 
on or off when an action of 1 or 0 is taken. Time is also represented in the 
state. 

IV. Reinforcement Learning 

A policy is a mapping between a state-space S and an 
action-space A. Given a policy, we know what action to 
perform at each state of the system. 

For the defined MDR our goal is the following; Find an 
optimal policy tt* : S —>■ A s.t: 

T 

TT* = argmax„^n'^^(st,7r(st),f(st,7r(st))). (8) 

t=i 

Where 11 is the space of all possible policies. 
Reinforcement Learning (RL) |[8) is a field in Machine Learn¬ 
ing that studies algorithms that learn by interacting with the 
environment. The learning is done for states and actions. For 
each state s, given a policy tt, a state value is defined 

as: 

T 
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V. Reinforcement Learning Algorithmic 
Solutions 

In this section we present three different reinforcement 
learning algorithms for solving]^ 

A. Algorithm 1 - Approximate Policy Iteration (API) using 
Classification 

An extension to the state value u^(s) defined above, is the 
state-action value function Q''^{s,a), which denotes the value 
of performing action a (regardless of the policy tt), and only 
after that - following policy tt. In our first algorithm we use 
the state-action value function. This function is defined the 
set of all (s,a) pairs, which is of size [S'! • |A|. 

1) Approximation: Our state-space grows exponentially 
with N: [S'! = 24-48^, as well as the action-space: |A| = 2^. 
Already for > 4, it is impossible to find the exact value 
for each state-action pair. We therefore use an approximation 
method for evaluating the state-action value function a). 


We use feature-based regression, which significantly lowers 
the dimension of Q'^(-,-) from • |A.| to dim{(j>{s,a)), the 
dimension of the feature vector (j){s,a). We use 4 binary fea¬ 
tures for each generator i, for each of the possible ’interesting’ 
zones it can be in: 
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The features are then duplicated N times and zeroed out 
at indices where the action vector is 0. The result is again 
duplicated into two - for distinguishing between (s,a) pairs 
that will lead to a catastrophe (transition to infeasible states). 
We end up with a feature vector (j){s, a) of dimension that is 
only quadratic in N: dim{(j){s, a)) = 2 • 4 • N'^. 

2) Policy Iteration Algorithm: The basic algorithm is policy 
iteration |[^. This well-known algorithm iterates between two 
stages: evaluation of the states’ values under a fixed policy, and 
the improvement of the policy using the learned values. We 
perform the evaluation using the SARSA Q algorithm (with 
epsilon-greedy exploration), and the improvement is simply 
done using the following maximization (for step k): 


T/c(s) = argmaxQfe(s, a) = argmax0(s, (10) 

a^A a^A 


3) Policy Representation: The biggest challenge in en¬ 
abling policy iteration for our problem is the choice of policy 
representation. On the one hand, the policy should be defined 
for all states s G S. On the other hand, it can practically only 
be trained using a very small fraction of this huge state-space. 
Also, its output is a selection from the enormous space of 
actions. 

To handle the above difficulties, we chose the policy to be a 
classifier, that classifies states into actions. This gives rise to a 
large-scale multi-class classification task (2^ optional classes), 
which is considered to be a difficult problem on its own. We 
tackle that by using a hierarchical classifier with a tree-based 
structure: Each node classifies an action bit and splits into 
two nodes for the next bit. Classification is done in the feature- 
space. The perceptron algorithm p4) is used a the basic binary 
classifier (online updates can be made to save memory). A 
different tree is stored per each time-step. 

Note that this is not a decision-tree classifier, but multiple 
binary hyper-plane based classifiers that are being traversed 
through in a sequential manner. The leafs determine the final 
action prediction (encapsulate the path). 







Algorithm 1 Approximate Policy Iteration using Classification 
Initialize: 

a - SARSA step size 
e - exploration parameter 
iVpi - iteration count 
ttq - intial policy 
(/) - basis functions 
1 : for A: = 1 to fc = do 
2 : Wk = SARSA{Trj;,a,€) 

3: for all s G S' do 

4: a* = arg max^g^ (/)(s, a)^Wfc 

5 : TTk = updateClassifier{a*,Trk) 

6: end for 

7: end for 
8: return ttaTj^. 


B. Algorithm 2 - Tree Search 

An MDP can be represented as a tree, where each node 
is a state and each edge corresponds to an action. In our 
problem, we can theoretically express the tree explicitly, where 
the edges include the exact reward of the transition since 
transitions and rewards are deterministic. Let us also denote 
s* so be the j-th state at time-step t. Under this representation. 


best action that can be taken from state s‘, by iterating on 
all possible actions from that state. ’’Best” in this case, is 
considered with regard to all possible paths in a lookahead 
horizon of H time-steps ahead. That is, it finds a‘, the first 
action in the vector a = (a‘, ..., where 

t+H 

a = arg max ^ R{s* , a* , f{s* , a* )). 

a'GA« 


Algorithm 2.1 {am,Vm) = findBestAction{s*,H) 
1 : if H == 0 or f == T — 1 then 
2 : return (0,0) 

3: end if 

4: V<jji — CX), d'lji — 0 

5: for all a G A do 

6 : 

7: = findBestAction{s*~^^,H — 1) 

8 : V* = R{s*,a,s*+^)+v*+^ 

9: if V* > Vra then 

10 : am = a,Vm = V 

11 : end if 

12 : end for 

13: return (am,Vm) 
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Fig. 3. Visualisation of algorithm 2 - tree search. Nodes are states, edges 
are transitions with the corresponding rewards. 

finding an optimal policy tt* corresponds to finding the largest 
aggregated reward path from the root (initial state Sg). 
However, since the number of possible paths in the tree is in 
the order of \A\^ = 2^'^ , naively searching the tree for an 
optimal path is intractable for a problem of our size. Therefore 
in our tree search algorithm we limit the time horizon in which 
we search to be H (H < T). I.e, tree search seeks a lookahead 
policy by iterating through all of the possible outcomes in a 
limited lookahead horizon H. Our algorithm contains the two 
following main components: 

1) Algorithm 2.1: The first part of the algorithm, 
findBestAction{s^,H), recursively searches for the next 


2) Algorithm 2.2: The second component of our tree search 
algorithm finds the optimal lookahead policy by initiating 
findBestPolicy per each time step from f = 0tof = T— 1. 


Algorithm 2.2 tt = 

= treeSearch{s*, H) 

1 

for t = 0 to t 

= r - 1 do 

2 

{drn-i '^m) — 

findBestAction{s*, H) 

3 

7r(s*) = am 


4 

gt+i ^ 

, Clrn') 

5 

end for 


6 

return tt 



3) Improve by Sub-sampling: We can take advantage of 
a certain property of this problem: it is very unlikely that 
in ’’good” (highly rewarding) paths, subsequent actions will 
differ significantly from each other. This is both because of the 
high start-up costs of generators, and because of the minimal 
up/down time limitation (rapidly switching different machines 
on/off can lead to infeasible states, where there aren’t enough 
available generators to satisfy demand). 

Exploiting this property, we added an improvement for our 
algorithm. Instead of iterating throughout all actions when 
searching for the best one at time f-|-1, we only sample small 
deviations from last best action at time t (denoted as a**). For 
the sampling we use a probability density over the action 
with an inverse relation to ||a‘* — a*“''^|| 2 . This improvement 
significantly reduces the runtime, and can also enable setting 
a larger value for H. In the experiments section, we test the 
usability of our improvement and compare it to the original 
approach. 
































VI. Experiments 


C. Algorithm 3 — Back Sweep 

Our ’’Back Sweep” algorithm is a novel algorithm, inspired 
by the concept in dynamic programming of backtracking from 
the terminal time and going backwards. That way, we have a 
reliable estimation of the value of future states, and can base 
decisions correctly based on that knowledge of future values. 
The main novelty is in sampling ’interesting’ (potentially 
beneficial) areas of the state-space, and use a nearest neighbour 
(NN) approximation of them in the Bellman update step 
(defined below). 

1) Algorithm 3.1: First of two parts of the algorithm is 
evaluation the optimal value of each sampled state, v*{s). The 
optimal value is the value of states when using the optimal 
policy as defined in[^ 

w*(s) = sup (11) 

TT^n 

It is found using Bellman’s update step, that lies in the heart 
of the algorithm; 

v*{s) = max[i?(s, a, s') + u*(s')]. (12) 

a£A 


Algorithm 3.1 V = evaluateStates{Ns, sq) 

1: Initialize = 0, s = 

2 : for f = T — 1 to 0 do 

3: S* = sampleEnvironment{s, Ng) 

4: for z = 1 to Ns do 

5: v*Hsi) = maxag^[i?(s*,a,/(s*,a)) + 

v*^+\NNifisl,a),V))] 

6: end for 

7: V = VU{{S\±**)} 

8: s = argmaxsV (s) 

9: end for 
10 : return T) 


• sampleEnvironment{s, Ng) returns Ng samples of 
states that are ’close’ to s. Closeness is quantified using 
a metric we defined. 

• NN{s, T>) returns the nearest-neighbor state from all 
states that are in the (s, v) pairs in T>. 

2) Algorithm 3.2: The second part of the algorithm will 
produce a greedy policy via one quick sweep forward, begin¬ 
ning from the initial state s°. This policy is greedy since at 
each step we choose the best possible action, and it is proven 
that for exact v* values, it will also be the optimal policy Q. 

Algorithm 3.2 tt = findGreedyPolicy {T>) 

1 : for t = 0 to r — 1 do 

2: o‘ = argmaxaeA[l?(s*,o,/(s*,a)) -f 

v**+^{NNif{sl,a),V))] 

3: 7r(s*) = a* 

4: s*+l = /(s*,a*) 

5: end for 
6: return tt 


In order to test the performance of the three proposed 
algorithms, we used Matlab p2| to implement and run them on 
a problem setting with = 12 generators, a 24-hour schedule 
(T = 24), with parameters taken from Q. In 0, an Adaptive 
Simulated Annealing (SA) technique is used, and a minimal 
objective of $644,951 is achieved. 

A. Algorithm 1 

Algorithm 1 only performed well on a smaller setting of 
the problem {N = 8, T = 12) and was not included in Table 
1^ In spite of that, we chose to present algorithm 1 in this 
paper as a baseline. API is a very commonly used algorithm 
in the reinforcement learning literature. On top of that, the 
policy structure we devised enabled the algorithm the leap 
from performing only on a iV = 4, T = 8 setting, to the 
N = 8, T = 12 setting. 

We also find value in understanding its weaknesses - it could 
not handle a larger scheme due to its forward-looking mecha¬ 
nism. Since it starts with a random policy, state evaluation is 
very poor at the beginning (compared to their optimal value), 
and the improvement becomes slow and inefficient throughout 
iterations. Magnification of this problem is taking place since 
unlike in infinite horizon formulations, different policies are 
used for different time-steps. 

B. Algorithm 2 

Algorithm 2 was tested with two different lookahead hori¬ 
zons; H = 1 and H — 3. The extremely low run-time for 
H = 1 make it the most preferable algorithm for this problem, 
in spite of the small increase in objective cost. 

The large difference in run times for the two cases is due to 
the exponential complexity in H. 

The improved version of algorithm 2, which includes sub¬ 
sampling of actions, enables a reduction in run-time, while 
compromising negligible value in the overall cost. 

C. Algorithm 3 

The terminal state of algorithm 2’s solution is fed as an 
initial state to algorithm 3, Sampling count used was Ng = 50. 

D. Result Comparison 


TABLE I 

Experiment results oe the dieferent algorithms 


Algorithm 

Objective cost 1$] 

Run-time tmin] 

SA in 

702,379 

N/A 

Adaptive SA in 1 

644,951 

145 

Tree Seareh, H=1 

512,850 

2.5 

Tree Search, H=3 

512,217 

240 

Sub-sampled Tree Search, H=3 

512,850 

85 

Back Sweep 

511,500 

60 


Table [I] summarizes the experiment’s results. The solutions 
we obtain are very similar to each other, all around $512,000 
for operation cost. 























Algorithm 2 produces a 27% improvement in objective value 
compared to the state-of-the-art algorithm presented in |[I], 
which achieved a minimal objective of $644,951, with only 
2.5 minutes of running time, compared to 2.5 hours in JT]. 

VII. Summary 

In this paper we introduced three algorithms from the field 
of reinforcement learning, one of them novel. We formulated 
the unit commitment problem as a Markov Decision Process 
and solved it using the three algorithms (successfully with 
two). 

The superior results in the experiments section lead us to 
believe that modelling the UC problem as an MDP is highly 
advantageous over other existing methods, which were men¬ 
tioned in the introduction section. 

An additional significant improvement is the option of an im¬ 
mediate extension for a stochastic environment, which include 
consideration of uncertainties. Demand, generation capacity, 
and generation costs can be easily modelled as stochastic 
by setting the appropriate probabilistic transition kernel and 
reward function in our existing MDP model. The algorithms 
presented in the paper need not change for obtaining a solution 
for such a probabilistic version. This transition to an uncertain 
formulation might be very challenging p5) , | |2^ , or even 
impossible when using other optimization methods. 

We intend to test our algorithms under such uncertainty 
conditions, and possibly to change the formulation in order 
to obtain a risk-averse strategy for the unit commitment. This 
can be done by including a risk criterion in the objective, that 
will take into account contingencies, shut-downs, and load 
shedding costs while considering the probabilities of those 
events to happen. 
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